©2017 JD Watrous, M Henglin, B Claggett, S Cheng, M Jain, Brigham and Women’s Hospital and UCSD, all rights reserved.

citation: TBA

# Data manipulation
library(magrittr)
library(purrr)
library(dplyr)
library(tidyr)

# Plotting
library(ggplot2)

# Shape-based clustering
library(dtwclust)

We create some sample data here. While we will be using real data (loaded in the background) for the rest of this documentation, the data generated below provides an example of the structure of our data.

plate <-
  1:10

well <-
  1:10

metaboliteID <-
  runif(10, 100, 900)

while (length(unique(metaboliteID)) != length(metaboliteID)) {
  metaboliteID <-
    runif(10, 100, 900)
}

dat <-
  expand.grid(plate, well, metaboliteID) %>%
  setNames(c('plate', 'well', 'metaboliteID'))


dat %<>%
  mutate(plateWellID = paste0(plate, '_', well),
         value = runif(n(), 1000, 1000000)) %>%
  tbl_df()

dat
## # A tibble: 1,000 × 5
##    plate  well metaboliteID plateWellID      value
##    <int> <int>        <dbl>       <chr>      <dbl>
## 1      1     1      460.749         1_1   5725.919
## 2      2     1      460.749         2_1   7731.122
## 3      3     1      460.749         3_1 634977.748
## 4      4     1      460.749         4_1 889502.918
## 5      5     1      460.749         5_1  66150.159
## 6      6     1      460.749         6_1  18650.618
## 7      7     1      460.749         7_1 561326.297
## 8      8     1      460.749         8_1 562686.695
## 9      9     1      460.749         9_1 852415.995
## 10    10     1      460.749        10_1 931310.006
## # ... with 990 more rows

For our sample data, the columns are:

Once again, this is data generated only to give an idea of what our data looks like. It is not used beyond this point.

We measure samples on 96-well plates. For a given plate, wells are run one at a time, before moving onto the next plate. If a metabolite is unmeasured across an entire plate, we label that plate as misaligned. We then aggregate those results to calculate the mean misalignment score for a metabolite across the entire dataset.

misalign <- 
  dat %>%
  group_by(metaboliteID, plate) %>% 
  summarise(misaligned = all(is.na(value))) %>% 
  ungroup() %>% 
  group_by(metaboliteID) %>%
  mutate(misaligned = mean(misaligned)) %>%
  ungroup()


dat %<>% left_join(misalign, by = c('metaboliteID', 'plate'))

Here we show two plots, one a smaller subset of the other to show greater detail. Text has been removed to mask data identifiers. Here we can evaluate misalignment by plotting the measured values versus the chronological order by which individual samples were run (the plateWellID is structured to be plotted in chronological order in this case.) We can look for signs of excess or unusual signal variation over chronological time.

Note; an expected warning will appear to signify that points with NA, representing a metabolite that was not measured in a particular sample, are being omitted from the plot.

metabsToPlot <-
  dat %>%
  distinct(metaboliteID) %>%
  sample_n(100)

blankThm <-
  theme(
    text = element_blank(),
    strip.background = element_blank(),
    axis.ticks = element_blank()
  )

dat %>%
  ggplot() +
  geom_point(aes(x = plateWellID, y = value, colour = misaligned)) +
  facet_wrap( ~ metaboliteID, scales = 'free_y') +
  scale_color_gradient(low = 'blue', high = 'red') +
  blankThm
## Warning: Removed 359384 rows containing missing values (geom_point).

dat %>%
  inner_join(metabsToPlot, by = 'metaboliteID') %>%
  ggplot() +
  geom_point(aes(x = plateWellID, y = value, colour = misaligned)) +
  facet_wrap( ~ metaboliteID, scales = 'free_y') +
  scale_color_gradient(low = 'blue', high = 'red') +
  blankThm
## Warning: Removed 27965 rows containing missing values (geom_point).

One option that can be applied to make these plots more useful is to cluster the plots based on the ‘shape’ of each graph. This can be useful because we expect certain types of sources of ‘drift’ (i.e. signal variation over time) to generate certain types of shapes of the data when plotted as a time series. The main caveat, when considering this option, is that calculating the clusters can be time-intensive with respect to computation. For example, for the dataset generating these plots, the computational processing time is roughly 5-10 minutes. Below, we cluster using the dtwclust package. This package offers a variety of clustering methods and parameters. Here, we will use hierarchical clustering and generate 10 clusters.

# Format the data for the clustering function
tsdat <-
  dat %>%
  select(metaboliteID, value) %>%
  split(.$metaboliteID) %>%
  map(function(x) {
    x[, 'value']
  })

nms <- names(tsdat)

tsdat %<>%
  bind_cols %>%
  set_names(nms) %>%
  as.data.frame %>%
  t()

# Here, we replace NA's with zero. Recognizing this step can be important for understanding why certain metabolites may end up clustered together, especially when trying to understand why a mostly misalgned metabolite might be assigned to a specific cluster.
tsdat[is.na(tsdat)] <- 0


# Start the clock!
ptm <- proc.time()

# Compute Hierarchical clusters
clusts <-
  dtwclust(tsdat,
           k = 10, # Number of clusters
           type = 'hierarchical', # Clustering technique
           method = 'ward.D2', # linkage method
           distance = 'sbd', # Shape based distance
           preproc = zscore)

# Get the cluster labels
clusterLabels <-
  data.frame(metaboliteID = nms,
             label =  cutree(clusts, k = 10))


# Stop the clock
ptmf <- proc.time() - ptm
# All these factors influence cluster time.
# Clustering Time (Elapsed, Seconds)
ptmf
##    user  system elapsed 
## 267.329  70.189 339.480
# Dimensions of ouor data (row, column)
dim(dat)
## [1] 1409694      11
# Number of unique metabolites
length(unique(dat$metaboliteID))
## [1] 1166

When we recreate the plots using these labels, we see that like metabolites are now adjacent to one another. While not perfect, clusters of similar shapes can be identified. Here, coloured backgrounds are used to indicate which metabolites share the same cluster.

# For coloured backgrounds
rectangles <-
  dat %>%
  group_by(metaboliteID) %>%
  arrange(plateWellID) %>%
  summarise(xmin = plateWellID[1],
            xmax = plateWellID[n()],
            ymin = min(value, na.rm = T),
            ymax = max(value, na.rm = T)) %>%
  left_join(clusterLabels) %>%
  mutate(label = as.factor(label))
dat %>%
  left_join(clusterLabels, by  = 'metaboliteID') %>%
  ggplot() +
  geom_point(aes(x = plateWellID, y = value, colour = misaligned)) +
  facet_wrap( ~ label + metaboliteID, scales = 'free_y') +
  scale_color_gradient(low = 'blue', high = 'red') +
  geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = label), 
            alpha = 0.3,
            data = rectangles) +
  blankThm

dat %>%
  inner_join(metabsToPlot, by = 'metaboliteID') %>%
  left_join(clusterLabels, by  = 'metaboliteID') %>%
  # mutate(plateWellID = as.numeric(as.factor(plateWellID))) %>%
  ggplot() +
  geom_point(aes(x = plateWellID, y = value, colour = misaligned)) +
  facet_wrap( ~ label + metaboliteID, scales = 'free_y') +
  scale_color_gradient(low = 'blue', high = 'red') +
  geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = label), 
            alpha = 0.3,
            data = inner_join(rectangles, metabsToPlot, by = 'metaboliteID')) +
  blankThm

It may also be of interest to transform the data before clutering and plotting it, to see if other relationships can be revealed. Here, we reorder the data so that plates with the smallest mean value will be plotted first, while plates with the largest mean value will be plotted last. Within plates, points are plotted chronologically. This step can be helpful for shape-based clustering, when applied, to more specifically distinguish between metabolites more or less affected by certain types of sources of signal variation over time. For instance, plate re-ordered data may in some datasets allow for easier distinctions to be made between misalignment or batch effect (wherein step up patterns are more common with or without alterations in plate-specific dispersion or variance) versus more subtle instrument drift or contaminant; such re-ordering may also facilite shape based recognition of chemical instability patterns.

# Format the data for the clustering function
tsdat <-
  dat %>%
  select(plate, well, metaboliteID, value) %>%
  split(.$metaboliteID) %>%
  map(function(x) {
    x %<>%
      group_by(plate) %>%
      mutate(plateMean = mean(value, na.rm = T)) %>%
      ungroup() %>%
      arrange(plateMean, well) %>%
      select(value)
    
    x[, 'value']
  })

nms <- names(tsdat)

tsdat %<>%
  bind_cols %>%
  set_names(nms) %>%
  as.data.frame %>%
  t()

# Here, we replace NA's with zero. Recognizing this step can be important for understanding why certain metabolites may end up clustered together, especially when trying to understand why a mostly misalgned metabolite might be assigned to a specific cluster.
tsdat[is.na(tsdat)] <- 0

# Compute Hierarchical clusters
clusts <-
  dtwclust(tsdat,
           k = 10, # Number of clusters
           type = 'hierarchical', # Clustering technique
           method = 'ward.D2', # linkage method
           distance = 'sbd', # Shape based distance
           preproc = zscore)

# Get the cluster labels
clusterLabels <-
  data.frame(metaboliteID = nms,
             label =  cutree(clusts, k = 10))
# For coloured backgrounds
pmoDat <-
  dat %>%
  group_by(metaboliteID, plate) %>%
  mutate(plateMean = mean(value, na.rm = T)) %>%
  ungroup() %>%
  group_by(metaboliteID) %>%
  arrange(plateMean, well) %>%
  mutate(plateWellID = 1:n())

rectangles <-
  pmoDat %>%
  group_by(metaboliteID) %>%
  arrange(plateWellID) %>%
  summarise(xmin = plateWellID[1],
            xmax = plateWellID[n()],
            ymin = min(value, na.rm = T),
            ymax = max(value, na.rm = T)) %>%
  left_join(clusterLabels) %>%
  mutate(label = as.factor(label))
pmoDat %>%
  left_join(clusterLabels, by  = 'metaboliteID') %>%
  ggplot() +
  geom_point(aes(x = plateWellID, y = value, colour = misaligned)) +
  facet_wrap( ~ label + metaboliteID, scales = 'free_y') +
  scale_color_gradient(low = 'blue', high = 'red') +
  geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = label), 
            alpha = 0.3,
            data = rectangles) +
  blankThm

pmoDat %>%
  inner_join(metabsToPlot, by = 'metaboliteID') %>%
  left_join(clusterLabels, by  = 'metaboliteID') %>%
  # mutate(plateWellID = as.numeric(as.factor(plateWellID))) %>%
  ggplot() +
  geom_point(aes(x = plateWellID, y = value, colour = misaligned)) +
  facet_wrap( ~ label + metaboliteID, scales = 'free_y') +
  scale_color_gradient(low = 'blue', high = 'red') +
  geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = label), 
            alpha = 0.3,
            data = inner_join(rectangles, metabsToPlot, by = 'metaboliteID')) +
  blankThm

Another transform would involve removing the top 5% of values from each metabolite before plotting. We observed very high values in our data, and including these values obscures the detail contained in the majority of the data points. Thus, filtering out these values off can reveal more useful information about the structure of our data.

# Format the data for the clustering function
tsdat <-
  dat %>%
  select(plate, well, metaboliteID, value) %>%
  split(.$metaboliteID) %>%
  map(function(x) {
    x %<>%
      group_by(plate) %>%
      mutate(plateMean = mean(value, na.rm = T)) %>%
      ungroup() %>%
      arrange(plateMean, well) %>%
      mutate(value = ifelse(value <= quantile(value, probs = 0.95, na.rm = T), value, NA)) %>%
      select(value)
    
    x[, 'value']
  })

nms <- names(tsdat)

tsdat %<>%
  bind_cols %>%
  set_names(nms) %>%
  as.data.frame %>%
  t()

# Here, we replace NA's with zeros. Recognizing this step can be important for understanding why certain metabolites may end up clustered together, especially when trying to understand why a mostly misalgned metabolite might be assigned to a specific cluster.
tsdat[is.na(tsdat)] <- 0

# Compute Hierarchical clusters
clusts <-
  dtwclust(tsdat,
           k = 10, # Number of clusters
           type = 'hierarchical', # Clustering technique
           method = 'ward.D2', # linkage method
           distance = 'sbd', # Shape based distance
           preproc = zscore)

# Get the cluster labels
clusterLabels <-
  data.frame(metaboliteID = nms,
             label =  cutree(clusts, k = 10))
# For coloured backgrounds
pmotDat <-
  dat %>%
  group_by(metaboliteID, plate) %>%
  mutate(plateMean = mean(value, na.rm = T)) %>%
  ungroup() %>%
  group_by(metaboliteID) %>%
  mutate(value = ifelse(value <= quantile(value, probs = 0.95, na.rm = T), value, NA)) %>%
  arrange(plateMean, well) %>%
  mutate(plateWellID = 1:n())

rectangles <-
  pmotDat %>%
  group_by(metaboliteID) %>%
  arrange(plateWellID) %>%
  summarise(xmin = plateWellID[1],
            xmax = plateWellID[n()],
            ymin = min(value, na.rm = T),
            ymax = max(value, na.rm = T)) %>%
  left_join(clusterLabels) %>%
  mutate(label = as.factor(label))
pmotDat %>%
  left_join(clusterLabels, by  = 'metaboliteID') %>%
  ggplot() +
  geom_point(aes(x = plateWellID, y = value, colour = misaligned)) +
  facet_wrap( ~ label + metaboliteID, scales = 'free_y') +
  scale_color_gradient(low = 'blue', high = 'red') +
  geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = label), 
            alpha = 0.3,
            data = rectangles) +
  blankThm

pmotDat %>%
  inner_join(metabsToPlot, by = 'metaboliteID') %>%
  left_join(clusterLabels, by  = 'metaboliteID') %>%
  # mutate(plateWellID = as.numeric(as.factor(plateWellID))) %>%
  ggplot() +
  geom_point(aes(x = plateWellID, y = value, colour = misaligned)) +
  facet_wrap( ~ label + metaboliteID, scales = 'free_y') +
  scale_color_gradient(low = 'blue', high = 'red') +
  geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax, fill = label), 
            alpha = 0.3,
            data = inner_join(rectangles, metabsToPlot, by = 'metaboliteID')) +
  blankThm